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Abstract 

Perceptrons with graded input-output relations and a limited output 
precision are studied within the Gardner-Derrida canonical ensemble 
approach. Soft non-negative error measures are introduced allowing 
for extended retrieval properties. In particular, the performance of 
these systems for a linear and quadratic error measure, corresponding 
to the perceptron respectively the adaline learning algorithm, is com- 
pared with the performance for a rigid error measure, simply counting 
the number of errors. Replica-symmetry-breaking effects are evalu- 
ated. 



PACS numbers: 87.10.+e; 64.60. Cn 

Short title: Graded perceptrons with soft error measures. 
W e-mail: Desire.Bolle@fys.kuleuven.ac.be 

Also at Interdisciplinair Centrum voor Neurale Netwerken, K.U. Leuven, Belgium 
( 2 )e-mail: rubem@if.ufrgs.br 

Present address: Instituto de Ffsica, Universidade Federal do Rio Grande do Sul, 
Caixa Postal 15051, 91501-970 Porto Alegre, RS, Brazil 



1. Introduction 



Graded-response perceptrons constitute the basic building blocks of layered 
architectures trained by the backpropagation algorithm. This motivates the in- 
terest in these systems over the last years. Questions pertaining to retrieval 
properties of specific architectures 0-0, to optimal capacities of networks de- 
signed to perform a given storage task 0-[0] and to generalisation abilities || 
have been adressed by statistical mechanics approaches. 

In this paper, we develop a Gardner-Derrida (GD) type analysis of the op- 
timal storage properties for graded-response perceptrons when allowing errors. 
The underlying idea thereby is to view learning in these perceptrons as an op- 
timization process in the space of couplings. By introducing soft non-negative 
error measures we investigate the canonical ensemble generated by the corre- 
sponding cost function in the space of couplings using the replica method. In 
this discussion we allow for a limited output precision in the storage task to be 
solved by the perceptron. In particular, a linear and a quadratic error measure 
are investigated. The corresponding cost functions are of special interest since 
they define a perceptron learning algorithm respectively an adaline learning al- 
gorithm through the method of gradient descent. For comparison we also derive 
the results for the rigid GD error measure that simply counts the number of er- 
rors. Replica-symmetric (RS) and first-step replica-symmetry-breaking (RSB) 
solutions for the storage capacity and the average output error are studied. 

For the case of two-state atractor neural networks the canonical ensemble 
approach advocated in ref. [g] has been streamlined and extended to other cost 
functions than the rigid one ||1U|| . The methods and results obtained there are, of 
course, also relevant for perceptron networks. First-step RSB effects above the 



critical capacity have then been studied in [|TT[ for binary perceptron networks 



with a GD cost function and have been extended to other cost functions |fL2|j 



[|T3|]. Recently, it has been shown []TJ| for the GD cost function that in the 
region above the critical capacity full RSB is necessary for an exact solution. A 
direct evaluation of the two-step RSB solution has been performed in this case, 
yielding a minimum storage error only slightly greater than the one-step RSB. 
The conclusion was put forward that for most practical purposes one-step RSB 
will be adequate. 

The rest of this paper is organized as follows. In section 2 we shortly review 
the canonical approach adapted to the graded-response perceptron and introduce 
the different cost functions we want to consider: the rigid one, the linear one and 
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the quadratic one. Section 3 contains the replica theory for these cost functions 
and determines the critical storage capacity, the distribution of the local fields 
and the average output error. Both the RS approximation and the first step RSB 
are treated for a general monotonic input-output relation. Section 4 describes 
the results of this theory applied to two specific, frequently used input-output 
relations, i.e., the hyperbolic tangent and the piecewise linear one. In section 5 
the most important results are summarized. Finally, the appendix contains the 
technical details of the derivations. 

The analysis reported on in this work extends our results || [7| on the optimal 
capacity of graded-response perceptrons in the framwework of the Gardner theory 
[0- 

2. Canonical ensemble approach 

The task to be solved by the graded-response perceptron is to map a collection 
of input patterns {£f ; 1 < i < N}, 1 < /i < p, onto a corresponding set of outputs 
1 < /i < p, via 

C = g(ih») (1) 
= -TW^J^- (2) 

Here g is the input-output relation of the perceptron, which is assumed to be 
a monotonic non-decreasing function. In ([I]) 7 denotes a gain parameter, and 
is the local field generated by the inputs {£f} as specified in (@). The Jj 
are couplings of an architecture of perceptron type. We restrict our attention 
to general unbiased input patterns specified by (£{*) = and = S^dijC. 

Since the effect of C in ([!]) can be absorbed in the gain parameter we take C = 1 
in the sequel. 

We explicitly allow a limited output precision in the mapping (0). In other 
words the output that results when the input layer is in the state {£f } is accepted 
if 

g( 1 (h^)El mt (e,e) = [C"-e,e + e], /i=l,...,p, (3) 

where e denotes the allowed output-error tolerance. 

The strategy of the canonical approach is to require the graded-perceptron 
network to go through a learning stage in the space of couplings in order to find 
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for the absolute minima of a given cost function E {C M }) precisely networks 

with the properties ([|). This cost function is assumed to be a sum of local terms 
for each pattern ji 

E(m,{(»}) = Y,v(h»,cn ■ (4) 

The different cost functions that will be studied here can be put into the form 
V (h», C) = W s (C-e-g (7^)) + W s (g (7^) - C M - e) , (5) 

where 

W s {x) = x s 9 (x) , (6) 

and 9(x) is the Heaviside step function. For s = we get the GD cost function, 
that simply counts the number of the errors, irrespective of their size. Moreover 
we consider a linear cost function (s = 1), where the errors are weighted pro- 
portionally to their magnitudes and a quadratic cost function (s = 2) where the 
errors are weighted proportional to the square of their magnitudes. The relevance 
of this choice becomes clear when applying gradient descent dynamics to eq. (f|) 
with the result 

+W S ^ (g (7/O - C - c)] g' (7/1") • (7) 

Taking s = 1 respectively s = 2 in this expression, we find back the perceptron 
learning algorithm respectively the adaline learning algorithm with step size S 
for the graded perceptron. The GD cost function does not correspond to any 
learning algorithm. 



3. Replica theory 



The physical properties of the graded-response perceptron network defined 
above are derived by investigating the canonical ensemble generated by the free 
energy 
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where Z is the partition function 

z = /iI^IP fe J i - A exp [-023 ({/»"}, {C 1 })] . (9) 

In (^|) the mean spherical constraint J2i J? = N is adopted to fix a scale for the 
gain parameter 7 of the input-output relation. We are interested in the limit 
(3 — ► 00 in which the free energy gives information about the fraction of patterns 
that are stored incorrectly. In the usual way the free energy is assumed to be self- 
averaging with respect to the inputs and the outputs {C M }- This average, 
denoted by = can be performed by applying the replica trick. 

The standard order parameter that appears in such a replica calculation is the 
overlap between two distinct replicas in coupling space 

1 N 

Qxy = ^Y, J ^ J f A<A '> A,A' = l,...,n. (10) 

In the sequel we consider both the replica symmetry (RS) analysis and the one- 
step breaking effects (RSB1). We also suppress the index //. 
In the RS analysis we assume that 

qw = q, A < A' . (11) 

The optimal capacity properties of the system are obtained in the limit (3 — > 00, 
q — > 1, with (3(1 — q) = x taking a finite value. In this limit, a standard calculation 
analogous to the binary perceptron problem || |l0j leads to the averaged free- 
energy 

(f) = extT x U^ + aQBtmm[F RS (h^,x,t)]j^ , (12) 

with 

F RS (h, C, x, t) = V(h, C) + , (13) 

and where Dt = (dt/ v2tt) exp(— 1 2 /2), a = p/N denotes the storage capacity and 
(. . .){£} indicates the average over the distribution of the output patterns. 

Let us denote by ho((,x,t) the value of h that minimizes F(h,(,,x,t). For a 
determined storage capacity a the variable x is given by the saddle point equation 
9(f) /dx = 0, that can be rewritten in the form 



a 



R l = Qvt(h ((,x,t)-t) 2 ^ . (14) 
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We immediately remark that these results are not always stable against RSB. 
Following standard considerations || [16], [17| the stability condition reads 



&RS 



Dt 



dt 



[h ((,x,t)-t] 



< 1. 



(15) 



{C} 



For the exact mapping task where e = the result found in || for the critical 
storage capacity corresponding to the GD cost function is retrieved when we take 
the limit x — > oo in ( |14| ) 

, ,u2y__ (16) 



with 



l + (^>«} 

V (0 • 

7 



(17) 



Similar to binary networks |fL0|, PI , a c is the same for all cost functions. 
Clearly, for a > a c errors will be introduced that depend both in quantity and in 
size on the specific cost function used. An interesting expression to look at in this 
respect is the distribution of local fields since it provides more information on the 
deviation of the errors from the correct output (. For a given desired output £, 
it is defined as 

P(h\0 = (s[ h -^r r ^ J ^ 

^ V V7V ^ /'{JW€> 

where the thermal average over J is taken subject to the mean spherical constraint 



introduced before. Following Kepler and Abbott fllSH , we find for the graded 
perceptron 

PR.s(h\0= [Dt5(h-h ((,x,t)) . (19) 



An overall measure of the network performance is given by the average output 
error 

£ = (£(0> {c} , (20) 

where the (^dependent output error £(() is given by 

5(C) = fdh PRS (h\0 [W x {C-e-g ( 7 Zi)) + W 1 (g ( 7 /») - C - e)] . (21) 



From the results in the literature on the binary perceptron problem |T^, 
and from our former studies on the graded perceptron system H, M we expect 
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RSB effects. So we want to improve the RS results by applying the first step of 



Parisi's RSB scheme [19|. We, therefore, introduce the following order parameters 



- q ^ - \ (a, if ttl ^ A (22) 

where ai, /3i = 1, . . . , n/m; agj /?2 = 1, . . . , m and 1 < m < n. We remark that in 
the limit n— >0, 0<m<l. 



Similar to JTzJ we find after a standard but tedious calculation that in the 
limit qi — > 1~, m — > and < go < (Zi with m/ (1 — gi) = M a finite value and 
x = /3 (1 — gi), the free energy averaged with respect to the inputs {£} and the 
outputs {(} can be written as 

(/> = lim max j-_L_ l n [l + M(l - g )] 

/3^oo x,q ,M I 2MX 

g a 



. / Dt In *(C, x,g ,M,t )\ 1 (23) 
2x [1 + M(l - g )] Mx \7 vs ' ' yo ' ' °7 {C } J 

with 

#(C,x,g Q ,M,t ) = y Dtiexp|-MxminFR5 m (/i,C,x,go,*o^i)} (24) 

and 



FRSBiiKCx^o^o^x) = V(h,() + — [h-to^-ti^l - g J . (25) 

For a chosen storage capacity a, the variables x, go and M are given by the saddle 
point equations d{f)/dx = 0, d{f)/dq = and d{f)/dM = 0. 

The first step RSB distribution for the local fields corresponding to pattern ( 
becomes 

Prsbx{KQ = / Dt„ / Dti — 7 rj— , 26 

J J W(C, x, g , M, t ) 

where /i = ^o(C) x > <?cb ^o> *i) is the value of h that minimizes F R sbi (h, (, x, g , to, 
The average output error in RSB1 approximation is obtained by replacing the 
expression ([19D by fl26|) in ([21]). 
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4. Results for specific cost functions 



The theory outlined in the last Section has been applied to the specific cost 
functions defined in (H)-©. For the input-output relation g we have used both 
the hyperbolic tangent and the piecewise-linear function 

9(*) = { X Y If' < 1 ■ (27) 

I sign(£j elsewhere 

A priori, our aim is not to compare the macroscopic properties of graded-perceptrons 
for the two different input-output relations since, in general, they are qualita- 
tively the same. In fact, the results obtained here are complementary. For the 
hyperbolic tangent input-output relation the RS solution is found to be stable 
over an important range of values for the parameters a and 7 while in the case 
of the piecewise-linear input-output relation the RS solution is always unsta- 
ble. However, from a more technical point of view in the case of the hyperbolic 
tangent function, the mimimization of Frs in the corresponding averaged RS 
free-energy with respect to the local field h (recall eqs. (0) and (fTBD) only leads 
to an equation defining t as a function of the minimizing value ho (see the Ap- 
pendix). This equation needs to be inverted but depending on the values of 7 2 x 
and £ the inverse function may be multiple-valued and hence a (sometimes very 
tedious) Maxwell construction is required in order to make it single-valued. Con- 
sequently, only the RS solution is studied in detail in this case. On the contrary, 
the piecewise-linear input-output relation permits an explicit calculation of the 
minimizing values h ((, x, t) and h ((, x, qo,t ,ti) of the functions F RS and F RS bi 
in the corresponding averaged free-energies. This, in turn, simplifies drastically 
the calculations and both the RS and RSB1 solutions are completely worked out 
in this case. 

At this point, we remark already that the Maxwell construction in the hyper- 
bolic tangent case gives a discontinuity in h (t) having an effect on the stability of 
the RS solution. Similarly, due to the fact that the piecewise-linear input-output 
relation is not everywhere different iable a gap structure in the distribution of the 
local fields emerges signalling the instability of the RS solution [17]. The effects 
of RSB for the cost functions are found to be important. 

In the sequel we present the results of our calculations both for the hyperbolic 
tangent and the piecewise-linear input-output relations. In order not to inter- 
rupt the line of reasoning we refer all technical details of the calculations to the 
Appendix. 
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4.1. Hyperbolic tangent input-output relation 

In this part we compare the performance of the three cost functions defined in 
©"© by studying their average output error, £ (recall eq. (|20|)). Our strategy 
is to consider a linear (s = 1) and quadratic (s = 2) "entirely soft" cost function 
versus a "completely rigid" one (s = 0). Soft means that we do not fix the 
output-error tolerance e, since some outputs might be far away from the correct 
output (. Entirely soft indicates that we work without tolerance at all by putting 
e = 0. For the completely rigid cost function, e was determined in function of 
the loading capacity a, by solving (for e) the optimal capacity for the graded 
perceptron in the microcanonical approach (recall eq. (9) of ref. ||). 

The results are presented in Figs. 1 and 2. First, we show in Figs, la-c the 
loading capacity a as a function of the gain parameter 7 for a constant average 
output error £ = 0,0.1 and 0.2 in the case of the three cost functions. For the 
rigid cost function, we plot an additional curve for £ = 0.4 to indicate that the 
capacity has a maximum for finite 7, although only for higher values of £. In the 
case of both the linear and the quadratic cost functions no maximum is found for 
a finite gain parameter. Furthermore the de Almeida-Thouless (AT) line, a. at, 
is given, indicating that the region of RS breaking (at the right of the line) is 
important. The rigid cost function has the worst performance for all values of 7. 
For both the linear and the quadratic cost function a monotonically increasing 
(but bounded) capacity a results. For all values of 7, the linear cost function has 
the best performance. 

This behaviour of the graded perceptron network can be understood in terms 
of the "strategy" used by a specific cost function to arrange the local fields when 
learning the patterns. The rigid cost function puts all local fields in a connected 
interval, thereby minimizing its width. It does not try to optimize the learning 
inside the interval in order to decrease the average output error. However, the 
linear and quadratic soft cost functions do optimize their performance by pe- 
nalizing the errors linearly respectively quadratically with their size. They try 
to arrange the local fields in a close region around the value resulting in the 
correct output ( under the action of the input-output relation. In both cases 
the resulting distribution of the local fields shows a sharp peak (a <5-peak in the 
linear case) at h^, and decreasing tails. A gap in between can occur. The tails of 
the quadratic cost function decrease faster than those of the linear cost function. 
We will present figures below for the case of the piecewise-linear input-output 
relation where a similar behaviour has been found. 
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Finally it is very interesting to discuss in more detail the "gap" structure of 
the local fields, revealed by the line a g in Figs. 2a-b. For the rigid cost function, 
no gaps are present, since the output tolerance e is chosen such that all fields are 
inside a connected interval. For the linear and quadratic cost functions Figs. 2a- 
b present the relevant results in the (a — 7)-plane. A gap is present in the region 
between the lines a c (for £ = 0) and a g . For a < a c , the perceptron is not 
saturated, i.e., q < 1 and the present calculations do not cover this region. We 
notice that for small a the gap line lies very close to the AT-line. A similar 



behaviour has been noticed in binary networks trained with noisy patterns |20 



For growing a > a c , the width of the gaps decreases from an infinite value at a c 
to become zero as a approaches a g . In the region between a g and a at there are 
no gaps, but the RS solution remains unstable. 

Concerning the stability of our results with respect to RS breaking, we see 
that for the rigid cost function the curves for the capacity as a function of the 
gain parameter at constant average output error are "stable" starting from 7 = 
up to the point where the curves reach their maximum (in agreement with the 
results of for constant output-error tolerance). For the linear cost function, the 
RS curves are stable for small 7 and not so small £ . However, for the quadratic 
cost function all the curves for the hyperbolic tangent input-output relation are 
RS unstable. 

The origin of instability against RS breaking fluctuations is relatively easy to 
understand in the region where gaps in the local field distribution are present 
10f1 , |jl7| , pl| . One can argue that it is not possible to pass continuously from 



one replica of the system where a specific pattern is learned in one "band" of 
the local fields, to another replica where that pattern is learned in another band. 
The corresponding solutions are disconnected in the space of replicas, and the 
overlap between pairs of replicas cannot be the same for all pairs, contrary to the 
RS assumption. In the region where there are no gaps this argument is no longer 
valid. Here, one may argue that spreading the local fields over one single but wide 
band can also disrupt the space of replicas. May be the notion of critical band 
width is relevant here. This could be an interesting subject for further study. 



4.2. Piecewise linear input— output relation 

For the piecewise linear input-output relation we do consider a non-zero 
output-error tolerance e, i.e., all the inputs whose corresponding output lie inside 
the interval [£— e, C + e ] do not contribute to the average output error. As outlined 
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before the numerical calculations are easier than those for the hyperbolic tangent, 
and the study of the RSB1 solution in some detail becomes feasible. Numerical 
results are presented for e = 0.5. 

Before passing to these results, it is worth mentioning that the introduction 
of a fixed e allows us to replace the study of the s = cost function with a 
completely rigid constraint discussed in Section 4.1, by a true GD cost-function 
((D-i) for s = 0). 

In fig. 3 we see both the RS and the RSB1 average output error £ as a function 
of the loading capacity a for 7 = 1 for the three cost functions considered. As 
expected, Srsbi > £rs for all a > a c . In the present region of the network 
parameters, the linear cost-function gives the best performance. According to 
the RSBi results, the least efficient is the quadratic cost-function if a < 0.48, 
and the GD cost-function elsewhere. 

Figure 4 shows a as a function of 7 at constant £. For each cost-function, the 
upper (lower) curve corresponds to the RS (RSBI) result. For all 7 the highest 
capacity is given by the linear cost function and for 7 < ±2.5, the quadratic cost 
function gives the lowest capacity. 

The reason that the performance of the s = 2 quadratic cost function is worse 
here is based on the fact that with a non-zero e, the average output error de- 
creases. The curves for the hyperbolic tangent input-output relation are all for 
£ < 0.4, while for the piecewise linear input-output relation we have studied the 
case £ = 0.05. In the latter, we are closer to the critical capacity. From these 
calculations one might conclude that the relative performance of the different 
cost -functions depends also on the amount of errors. In other words, it mat- 
ters how far one is beyond the critical capacity and the quadratic cost-function 
performs better in the high-a regime. 

In order to discuss in more detail the effects of RSB, we have studied the 
distribution of the local fields for the three cost-functions. As an illustrative 
example, we present in fig. 5a-c the RS and RSBI distributions for the specific 
parameters a = 3, 7 = 1, e = 0.5 and £ = 0.6. In general, the discussion above 
concerning the RS field distribution for the hyperbolic tangent input-output rela- 
tion remains valid. For the RSBI distribution, the following has to be remarked. 
In the case of the GD and the linear cost-function the coefficients of the 5-part 
in the RSBI local field distributions become smaller. To give an idea about this 
change we mention that, e.g., for the GD cost function (recall equations (|A.5|) and 
( |A.12|) ) these coefficients are 0.479 at h = 1 for the RS solution versus 0.306 for 
the RSB solution. Similarly for the quadratic cost function the maximum in the 
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distribution decreases. Furthermore for the three cost-functions, the continum 
part of the distribution is more populated for the RSB1 than for the RS solution, 
and the width of the gaps are smaller. Finally, RSB-effects in the local field 
distibution are less pronounced for the quadratic cost-function. 

5. Concluding remarks 

In this paper we have studied the canonical ensemble approach to the op- 
timal capacity of graded-response perceptrons with a hyperbolic tangent and 
a piecewise-linear input-output relation for three different cost functions: the 
Gardner- Derrida cost function that simply counts the number of errors irrespec- 
tive of their sizes, the linear cost function where the errors are weighted propor- 
tionally to their magnitudes and the quadratic cost function where the errors are 
weighted proportionally to the square of their magnitudes. Results have been 
obtained for the storage capacity as a function of the gain parameter, for the dis- 
tribution of the local fields and for the average total output error above critical 
capacity in both RS and RSB1 approximation. 

The transition from RS to RSB occurs at the critical storage. RSB1 effects are 
important, especially for the distributions of the local fields. In agreement with 
standard results it is seen that whenever the distribution displays a gap the RS 
saddle point is certainly unstable. But in all cases considered here the instability 
stays in regions of the network parameters where no gap occurs (but, of course, 
the replicon eigenvalue is still positive). For small loading the gap line lies very 
close to the AT-line. The width of the gap itself decreases in RSB1. Already 
for a small average total output error (and an output tolerance 0.5) the capacity 
is overestimated in RS by typically about 10%. In general, RSB1 effects are the 
smallest for the quadratic cost function. 
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A Theory for specific cost functions 



In this appendix we apply the general theory discussed in Section 3 to the 
specific cost functions ©-(H). In particular we study the RS solutions for the 
hyperbolic tangent input-output relation and both the RS and RSB1 solutions 
for the piecewise-linear input-output relation defined in ( p7|) . 



A.l GD cost function 



In the case of the GD cost function with output tolerance e the results pre- 
sented here are valid, of course, for both input-output relations considered, by 
taking in the end the relevant expression for g~ l ((, — e) . In the case of the RS 
treatment, the minimum in h of Eq. fll~3|) is given by 



h 
h 
ho 
h 
h 



t, 

I, 
t, 
u . 



F RS (h ,(,x,t) 
F RS (h ,(,x,t) 
F RS {h ,(,x,t) 
F RS (h ,(,x,t) 



1 

(i-t) 2 

2x 



(u-t) 2 
2x 



for — oo < t < I — y2x 

for I - \/2x < t < I 

for I <t < u 

for u < t < u + \^2x 

for u + \[2x < t < oo 



where 



and 



-oo 



u 



7 

oo 



if C-o-i 

elsewhere 

if C + e < 1 
elsewhere . 



^From ( TT4f) and ( |A.1|) , we obtain the saddle-point equation 



a 



RS 



/ Dt(t-l) 2 + / Dt(t 

Jl—V2x Ju 



U 



Combining fll§|) and ( |A.1| ), the distribution of local fields becomes 



p(h,C) 



-hi 

e 2 



^7T 



9(1 - V2x -h) + 6{h -I)- 6{h - u) + 6{h 
+ 6{h-l) / _Dt+5(h-u) / ' Dt. 



u 



(A.l) 



(A.2) 



(A.3) 



(A.4) 



/ 2z) 
(A.5) 
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The RS output error is obtained from (pi) and (|A.5 ): 



7 



1 + 



7 / J-oo 



Dh 



l-V2x 



Dh(l-h) 



hi 



u+\<2x 



Dh(h - u) + u 



1 1 2 



where 



and 



h% = min \ I — V2x, ) 

V 7/ 



h 2 = max u 



7, 



(A.6) 



(A.7) 



(At 



For the RSB1 solution we get h ((,x,q ,t ,ti) and F RS Bi(ho,(,x,qo,t Q ,ti) 
from h ((,x,t) and F RS (h ,(,x,t), respectively, by substituting t by toy^o + 
t\y/l — qo in ( |A.1|) . The function x, qo, M, to) in (|24l) becomes 



^(C,x,g ,M,t ) = e 

r-n(Z,go,to) 



-Mi 



n(Z-V22,,<2o,*o) 



+ 



+ _ 

n(u+v / 2x,qo^o) 



f2(u,go,to) 



fi(Z,go,*o) 
oo 



Q(«,qo,*o) 



where 



and 



Dt!$(«,M, g ,£ ,*i) +e 



Mi 



Dti (A.9) 



Q(u+\/2x,qo,to) 



Sl(u,q Q ,t ) 



V 1 ~ Qo 



M, g , to, *i) = exp {-^(1 - g ) g , *„) - *i] 2 } • 

The averaged free-energy is obtained by plugging this expression into 
pression ( p6|) then leads to the (^-dependent distribution of local fields 



(A.10) 

(All) 
. Ex- 



p(h,C) 



exp 



-±n 2 (h, qo ,t ) 



^(C,x,g ,M,t 



2tt(1 - g ) 



-Mxgn 



2x-h 
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+ [0(h-l)-e(h- u)} + e~ Mx 9 (h-u- V2x 
+ S(h-l) ^_ Dti$(Z,Af,g ,to,*i) 

JQ(l— V2x,qo,to) 
rfl(u+\/2x,q ,to) 

+ 8{h-u) / Dti$(u,M, go ,to,*i; 

JU(u,q ,t ) 



Finally, the ^-dependent RSB1 output error is given by combining (|26| ) and 



(A.12) 



£(0 



Dt 



7 e' 



-Mi 



/ + - ) T 1 d/i+ / dh{l-h) 

7 / J-oo J/ll 



--Q 2 (h,q ,t ) 



(A.13) 



*(C, g , M, t ) ^/2tt(1 - g ) 

+ / 2 dh(h- u) + I- -u ) / d/ilexp 

\7 J Jh 2 J 

with /ii and h 2 defined in ( |A.7|) and ( |A.8|) respectively. 

A. 2 Linear cost function 

Let us start by considering the RS approximation first. For the piecewise linear 
input-output relation the minimization in h of Eq. ( |13|) can be done explicitly 
leading to the following result 

- oo < t < h\ 



ho 


= t, 


FRs(ho,(,x,t) 


= l(l + 


i) . 


for 


ho 


= 7X + t , 


Fns[ho,(,x,t) 


= 7 (/- 


f "*) 


for 


ho 


= 1, 


F RS (ho,(,x,t) 


_ d-t? 

2x 




for 


ho 


= *, 


Fns(ho,C,x,t) 


= 




for 


ho 


= u, 


F RS (h ,C,x,t) 


_ (u-t) 2 




for 


ho 


= —72; + t , 


F R s(ho,C,x,t) 


= 7 C-u 


-f + f) 


for 


h 


= t, 


F R s(h ,(,x,t) 


= 7g- 


u) 


for 



(A.14) 



where / and u are again given by the formula ( |A.2| ) and 
h 2 , h 3 and are defined as follows: 



hi 



l-j2jx(l + ±) if Z<-i + f 
elsewhere 



7 X 
IT 



. The variables /ix, 



(A.15) 
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ho 



l-J 2l x(l + i-) if K-h + ^r 



elsewhere 



I — 7X 



(A.16) 



u 



+ ^2jx^-u) if u>±- 



u + 72 



elsewhere 



"jX 

~2~ 



(A.17) 



hi 



u+^x^-u) if u>i-f 



i±2 
7 ^ 2 



elsewhere 



(A.18) 



The RS saddle-point equation is obtained from (|14]) and (|A.14|) : 



a -i = / 7 V /" 2 Dt + f Dt(t - I) 2 + f Hi Dt(t - uf + 7 V / * dA . 

\ Jhi Jh2 Ju Jhz J ^ 

(A.19) 

Using ( |19D and (|A.14|) , the RS (^-dependent distribution of local fields becomes 
exp 



2 



2tt 



[9(hi -h) + 9(h -I)- 9(h - u) + 9{h - h 4 )) 



+ 



exp 



(h-yx) 2 



2tt 



\9{h - ti 2 ) - 9{h - I)} + 5(h -I) Dt 

Jh 2 



h 3 exp 



+ 8{h -u) ( 3 Dt + 

J u 



(h+yxf 



[9(h -u)- 9{h - h' 3 )) (A.20) 



where 



ho 



and 



if Z<-± + f 
-- + elsewhere 

7 2 



U if ?l>i-2 

7 

- — 3£ elsewhere 

7 2 



(A.21) 



(A.22) 
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The RS average output error is obtained from ( |21| ) and ( |A.20| ): 




/ dhV2nexp 


(h — 7x) 2 




2 



u)+ [-- u) f 

\7 / J hi 



(l-h) 



dh 

. — ( 
n/2tt 



(A.23) 
is always 

a local minimum of (|13D . Other local minima of (p~3|) are defined as solutions of 

fdF(h,(,x,t)\ 



For the hyperbolic tangent input-output relation, ht given by (IT 



9/i 



(A.24) 



h=hg 



and they can no longer be determined analytically. The equation ( |A.24j ) defines 
t as a function of ho, 



t(h ) = h + <yxg'(<yh )sgn [g(jh ) - h] 



(A.25) 



that needs to be inverted in order to find ho = h Q (t) (the prime denotes the 
derivative with respect to h). Depending on the value of j 2 x, t(h ) is a monotonic 
function or not, and consequently it is invertible or not. The onset of non- 
monotonicity is given by the system of equations 



(A.26) 




If monotonicity holds, ho is a solution of ([A.25 ) for t < t 



t 



or t > t£, where 



: h^ ± ^yxg'^h^). If non-monotonicity holds, ho(t) has one or two jumps at 
t = ti and/or t = t 2 , whereby we assume that t x < t 2 . The values of t x and t 2 are 
then determined using a Maxwell construction in the function t(ho). The number 
of jumps depends on the value of 7 2 x and (. 

^From (|l^) and from the inversion of (|A.25|) , we obtain the following expres- 
sion for the distribution of the local fields: 



P(h\C) 



dt exp 



t 2 (h) 



dh 



2tt 



+ 




(A.27) 
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If monotonicity holds, then t\ = and t 2 = t£. Due to the fact that is a 
global minimum of (|T3| ) in the interval t\ < t < t 2 , t(ho) always displays a jump 
in h = h{. This jump gives rise to the second term in the r.h.s. of ( |A.27j ). If non- 
monotonicity holds, t(h ) shows plateaus at t 1 and t 2 , leading to a gap structure 
in the distribution of the local fields. The resulting discontinuity in dh /dt causes 
a divergence of the l.h.s. of (|T5|). This means that when non-monotonicity holds, 
the RS solution is always unstable. In the case of monotonicity, the stability 
condition reads 



OCRS 



Dt 



^xg"(jho) 



+ 



Dt 



^ X g"(jh ] 



+ 1 



< 1 



{<} 



(A.28) 

Next we consider first step RSB for the piecewise-linear input-output relation. 
Similarly to the GD cost function we substitute t by -\-t\*Jl — go in (|A.14|) 

in order to obtain h ((, x, g , t , ti) and F R sBi(h , (, x, g , t , ti). The free energy 
is obtained from (^) with the function x, g , M, t ) (recall eq. ([24])) given by 



ty((,x,q ,M,t ) = exp 



+ 



+ exp 

i-n(l,qo,to) 
JU(h2,qo,to) 

+ exp 
+ exp 



-M7X ( I - — - t 0y /% 



-M7X y + ~ 

n(h2,qo,to) 



Cl(hi,qo,to) 



Dti 



-M~fx 
-M72 



-u 



7X 



6 



u 



Dti exp 

0(^1,901*0) 

Q(u,qo,to) 
£l(l,qo,to) 

n(/i4,qo,*o) 

Dt\ exp 

Q(tl3,qo,to) 

Dtj. 



M-yxh 

fi(/i3,90,*o) 
fi(u,go,*o) 

M^xti 



l-<7o) 
Dti®(u,M, q ,t ,h) 



ln(h4,q ,t ) 

The RSB1 (^-dependent distribution of the local fields ( |26|) becomes 



9b) 

(A.29) 



p(h,C) 
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exp 



-Mjx (l + ±)-±n 2 (h,q ,t ) 



*((,x,q Q ,M,t ) 



2tt(1 - g ) 



{h - h) 



+ 



exp 



M 7 x (/ + f - /i) - \<tf(h - 7 x, g , t ) 



2tt(1 - g ) 



[6(h - h! 2 ) -6{h-l))0 [1 + 



7 
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f n(i,qo,t ) exp 
+ 5(h-l) Dt 1 m,M,q ,t ,t 1 ) + 

Jil(h 2 ,qo,to) 



in2 

2 



Q 2 (h,q ,t ) 



^(1 - go) 



[0(h - I) - 6(h - u)] 



+ 



rn((h 3 ,qo,to) 

+5(h-u) / D^$(w,M, go ,*o,ii) 

Jf2(u,g ,to) 

exp -M7X h-u+ : f - ffi(h + <yx,q ,t ) 



+ 



exp 



V2vr(l - go) 
-M 1X (±-u) -±n 2 (h,q ,t ) 



[6{h -u)- 6{h - h' z )\ 9 P - ^ - « 



9(h-h 4 ) 



/2tt(1 - g ) 

Finally, the ^-dependent RSB1 average output error reads 

£(0 = / 



(A.30) 



Dtr 



*(C,x,go,M,io) ^2tt(1 - g ) 



x iv + i) exp 

+ / dh(l — h) exp 
rK 

+ / dh{h — u) exp 

Ju 



-M7X ^ + i 



/hi 
d/i exp 
-00 



-n 2 (/i,g ,t ) 



-ifi 2 (/i - 7 x, g , t ) - M 7 x + ^ - h 



■-Vt 2 (h + 7X, g , t ) - M'jx \ h-u + — 



•yx 

v7~"2" 



+ -u exp 

V7 J 



—Myx [ u ] / d/i exp 

\7 /J A 4 



-ifi 2 (/i,g ,to) 



(A.31) 



A. 3 Quadratic cost function 

Again we look at the RS treatment first. We start by defining 



h x = I - Jl + 2 7 2 x I + 



ho = I 



7 



/is = W + 



(A.32) 
(A.33) 
(A.34) 
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flA 



u + ^l + 2-f 2 x I- - 



u 



(A.35) 



Using these definitions, the result of the minimization in h of F R g (Eq. (fHJ)) 

becomes 

F RS {h ,(,x,t) = j 2 (/ + i) 2 
F RS (h ,(,x,t) = 1 2 -^ 
F RS (h ,(,x,t) = 
F RS (h X,x,t) = 1 2 ^f x 

F RS (h ,(,x,t) = 1 2 (±-uf 



h = t, 

i _ 2-y 2 xl+t 
U ° l+2-y 2 x ' 

h = t, 

■L 2*y 2 xu+t 

' 4 o ~~ l+2-i 2 x 



h = t, 

The RS saddle-point equation is obtained from 
/ 2 7 2 x 



for — 00 < t < hi 

for hi < t < I 

for I < t < u 

for u < t < hi 

for hi < t < 00 



(A.36) 



and flOp : 



1 + 2 7 2 £ 



/ \Jhi Ju 



U 



(A.37) 



^From (|19|) and (|A.36 ), the RS ^-dependent distribution of the local fields be- 



6* 

comes 



P(h,C) 



exp 



hi 
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+ l + 2 7 2 x 



2vr 
exp 



- /i) + - /) - 6{h -u) + 6{h - M)] 



-\((l + 2 1 2 x)h-2 1 2 xiy 



[9(h - h 2 ) - 9{h - I)} 



+ 



exp 



-\((l + 2 1 2 x)h-2 1 2 xnY 



[9(h - hu) - 9{h - h 3 )] 



(A.38) 



Consequently, the RS (^-dependent output error becomes 

h 2 ' 



£(0 



7 



' l\f hl dh 
I + - ) I —ff= exp 



7 / J-00 V27T 
rl 



+ (l + 2 7 2 a;) j d/iv^exp 



+ (1 + 27V) / d/iv^exp 
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2^A 2 



((l + 2 7 V)/i- 27 2 a;f) 



((1 + 2 7 2 a;) h - 2 7 2 :rw) 2 



( 1 \ /— 00 dh 
+ u / — = exp 



2 



(1-Zi) 
(/i-u) 
(A.39) 
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For the hyperbolic tangent input-output relation, (recall eq. JIT])) is no 
longer a local minimum of ([13|) . The minima are always given by the solutions of 
( |Og ), that define t(/i ) as 



*(/io) = + 2 7 z</(7M [5(7^o) - C] • 



(A.40) 



Depending on the values of 7 2 x and (, t(h ) is a monotonic function or not. 
The onset to non-monotonicity is given by the system of equations ( |A.26[ ). If 
monotonicity holds, h (t) is continuous, and is obtained by inverting (|A.40 ). 
Otherwise, h (t) displays one or two jumps at t — t\ and/or t = t 2 , whose values 
are obtained by a Maxwell construction. 

The distribution of local fields is obtained directly from ( |T9| ) 



P(h\0 



dt exp 



dh 



2tt 



(A.41) 



When non-monotonicity holds, the jumps in ho(t) give rise to a gap structure in 
the distribution of the local fields and the RS solution becomes unstable. In the 



monotonic case the stability condition (15) reads 
+00 / 1 



Dt 



+ 1 



< 1. 



{C} 



(A.42) 

Let us finally turn to the RSB1 treatment. Again, in order to obtain ho((, x , Qo, to, ti) 
and FnsBi(ho, (, x, qo, t Q , ti) we substitute t by t ^/% + t\\/l — go i n (|A.14j) . The 
free energy is obtained from (p3|), whereby the function x, q , M, t ), given 
by fl2~4]) , becomes 
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-Mj 2 x ( u 
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(A.43) 
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For the RSB1 ^-dependent distribution of the local fields we obtain 

p(hX) = /_ Dt ° (expt-M^Q + I)-!^^,^,^ 

J *(C, x, g , M, t ) [ . ;.--m 
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The ^-dependent RSB1 average output error becomes 

Dt 7 



[9(h -u)- 6(h - h' 3 )} 
(A.44) 



£(C) 



*(C, x, g , M, t ) ,/2tt(1 - g ) 



x { [ I + -J exp 



-M 7 2 a; Z + 



7, 



hi 



dZi exp 



--Q 2 (h,q ,t ) 



rl 

+ (l + 2 7 2 x) / dh(l-h) cxp 

J hi 



-ifi 2 ((l + 2 7 2 x)(/i-Z) + Z,go,to) 



-M 7 2 a;(l + 27 2 x)(/i-Z) s 



(1 + 2 7 V) /^d/i(/i-M)exp[-^ 2 ((l + 2 7 2 a;)(/i-M) +M,g ,t ) 



-M7 2 x(l + 2rfx)(h-u)' 



21 



,7 
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-M<y 2 x 



u 
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dh exp 



--Q 2 (h,q ,t ) 



(A.45) 
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Figure Captions 



Figure 1 : Storage capacity a for the hyperbolic tangent input-output relation 
as a function of the gain parameter 7 at constant total average output error £ = 
(lower curve), 0.1, 0.2 and 0.4 (upper curve) for the GD cost function (a), the 
linear cost function (b) and the quadratic cost function (c). In (b) and (c) the 
line for £ = 0.4 is not shown. The dotted curve is the AT-line. 
Figure 2 : The gap structure for the hyperbolic tangent input-output relation 
in the a — 7-plane for the linear cost function (a) and the quadratic cost function 

(b) . The curve a c is the critical capacity, the curve a g represents the gap line 
and aAT is the AT-line. 

Figure 3 : The total average output error £ as a function of the storage capacity 
a for output tolerance e = 0.5 and gain parameter 7 = 1 in the case of the piece- 
wise linear input-output relation for the GD cost function (solid lines) , the linear 
cost function (dashed curves) and the quadratic cost function (dotted lines). For 
each case the upper curve is the RSB1 result, the lower one the RS result. 
Figure 4 : Storage capacity a for the piecewise linear input-output relation as 
a function of the gain parameter 7 for an output tolerance e = 0.5 and a total 
average output error £ = 0.05 for the GD cost function (solid lines), the linear 
cost function (dashed curves) and the quadratic cost function (dotted curves). 
For each case the upper curve is the RS result, the lower one the RSB1 result. 
The dashed-dotted curve is the critical storage capacity. 

Figure 5 : Distribution of the local fields for the piecewise linear input-output 
relation and a = 3, 7 = 1, e = 0.5 and the correct output ( = 0.6 for the GD cost 
function (a), the linear cost function (b) and the quadratic cost function (c). In 

(c) the results on the interval h = [—3, —1] are magnified by a factor 20. The 
dotted curves are the RS results, the solid lines the RSB1 results. 
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